-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added a BallistaContext to ballista to allow for Remote or standalone #1100
Conversation
The main purpose of this Pull Request is to update the Ballista Context and not necessarily to add all the functionality on top of the BallistaContext class found in DataFusion Python. I hope to get it on par with Datafusion Python in follow up MRs, this should be pretty easy because the class is calling DataFusion Python, so I am assuming I should be able to implement a Mixin that would allow me to get it to parity. Any advice on that is appreciated! |
@tbar4 is there any reason to keep BallistaContext and proxy all methods to SeasionContext ? Why we need to duplicate code? Idea is to decrease code we need to maintain. Can't we use SeasionContext directly? |
Please have a look at #1091, the main idea is to utilise PySessionContext directly, same effort we're doing with rust interface |
@milenkovicm updated to be more inline with #1091. In Python, running an init always returns a None, so I just added a |
great stuff @tbar4, two questions, it would be extra mile if we could do it:
from datafusion import SessionContext
# Create a DataFusion context
ctx = SessionContext.standalone() or from datafusion import SessionContext
# Create a DataFusion context
ctx = SessionContext.remove("http:// ...") |
@andygrove your opinion please, would it make sense to rename |
I think that we should stick with the Ballista name for anything in the datafusion-ballista project. btw, I originally chose |
This is a random place to give this feedback, but eventually, I would like to be able to start the executor and scheduler processes from the Python bindings as well. This was possible at one time. |
You mean not just connect, but also be able to start a an actual schedule/executor from Python? I was thinking about this considering you really can autoscale with Python, or are you meaning something different? |
make sense, do we want to take it in this task or capture this as follow up. I was planning to make scheduler and executor process easier to bootstrap |
Personally, I would like to make it a follow up. I'm deploying a Ballista instance as an alternative to Spark for testing at work and I don't see it going very far if users can use Python to interact with it. Merging this will at least give me a starting point and it will allow me to focus more on improvements with some breathing room while users test out Python. But I will am also going to look into making "standalone" an additional feature instead of part of the library today. |
I had a quick look at the PR I think you're overcomplicating it a bit, if we go back to original idea from Ballista import Builder
ctx : SessionContext = Ballista.builder.config("ballista.job.name","Super Cool Ballista Python Job").standalone() you just need struct BallistaBuilder {
confg: HashMap<String,String>
}
impl BallistaBuilder {
fn conf(self, key: String, value: String) -> Self {
self.conf.insert(key, value)
self
}
fn standalone(self) -> PySessionContext {
// create SessionConfig
// create SessionState
// create SessionContext
PySessionContext { ctx }
}
} wdyt @tbar4 ? |
Simple enough, though in a future issue I would like to implement the state and session config to be on par with the Rust implementation options for defining session state and ballista config options |
@milenkovicm I am using the deprecated BallistaContext, but just to define config and then passing the context created from the deprecated BallistaContext to the Python DataFusion SessionContext. I have tested this out on a 3 node cluster (AWS r5.2XLarge, 1 scheduler and 2 executors) and work is being routed to both executors. The current SessionContextExt doesn't take any config options as an argument and instead takes the state which contains a config. In a future PR I can make the switch if necessary, but this PR lets people get up an running with Ballista in Python and utilize all the features of DataFusion Python. |
@tbar4 I'm waiting for this pr to remove BallistaCo text, please don't use it if possible. You can create SessionContext like let session_config = SessionConfig::new_with_ballista()
.with_information_schema(true)
.with_ballista_job_name("Super Cool Ballista App");
let state = SessionStateBuilder::new()
.with_default_features()
.with_config(session_config)
.build();
let ctx: SessionContext = SessionContext::remote_with_state(&url, state).await?; Would that work in your case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tbar4 we're almost there, just few little changes needed IMHO
@milenkovicm @andygrove I am pretty sure this is what we want. We have a way to set config in Python, we are using the new SessionContext to build standalone or remote contexts. Let me know if we want any other changes! Otherwise I'm good for PR approval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're almost there
- Builder is almost there
- Documentation and code need some cleanup
- test is failing as it is using non existing class
@milenkovicm @andygrove ready for you all to take another look. Need a maintainer to resolve the merge conflict with |
I think this makes sense now, we can take further updates at the later date. you'll have to rebase your branch @tbar4 and resolve conflicts. |
@andygrove ready to be merged. @milenkovicm thanks for your patience and help! |
Which issue does this PR close?
Closes #1091
Rationale for this change
This adds a standalone context option when initializing a
BallistaContext
in Ballista PythonWhat changes are included in this PR?
added a context.py file to the python directory and also added new functionality to the BallistaContext file in python directory (context.rs)
Are there any user-facing changes?
This update should be backwards compatible and will default users to a standalone instance, but I think we should discuss if we should go with the original idea of having the Ballista Context initialization in Python for standalone be it's own module.